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Abstract 

Big data comes in various ways, types, shapes, forms and sizes. Indeed, almost all areas of science, 
technology, medicine, public health, economics, business, linguistics and social science are bombarded by 
ever increasing flows of data begging to analyzed efficiently and effectively. In this paper, we propose 
a rough idea of a possible taxonomy of big data, along with some of the most commonly used tools for 
handling each particular category of bigness. The dimensionality p of the input space and the sample 
size n are usually the main ingredients in the characterization of data bigness. The specific statistical 
machine learning technique used to handle a particular big data set will depend on which category it 
falls in ivithin the bigness taxonomy. Large p small n data sets for instance require a different set of 
tools from the large n small p variety. Among other tools, we discuss Preprocessing, Standardization, 
Imputation, Projection, Regularization, Penalization, Compression, Reduction, Selection, Kernelization, 
Hybridization, Parallelization, Aggregation, Randomization, Replication, Sequentialization. Indeed, it is 
important to emphasize right away that the so-called no free lunch theorem applies here, in the sense that 
there is no universally superior method that outperforms all other methods on all categories of bigness. It 
is also important to stress the fact that simplicity in the sense of Ockham's razor non plurality principle 
of parsimony tends to reign supreme when it comes to massive data. We conclude with a comparison of 
the predictive performance of some of the most commonly used methods on a few data sets. 

Keywords: Massive Data, Taxonomy, Parsimony, Sparsity, Regularization, Penalization, Compression, 
Reduction, Selection, Kernelization, Hybridization, Parallelization, Aggregation, Randomization, Sequen¬ 
tialization, Cross Validation, Subsampling, Bias-Variance Trade-off, Generalization, Prediction Error. 

I. Introduction 

We consider a dataset T> — {(x\,y\), (x.i,y 2 ), , (x„,y n )}, where = (xu,x; 2 , ■ ■ ■ , x !p ) denotes 

the p-dimensional vector of characteristics of the input space X, and i/, represents the correspond¬ 
ing categorical response value from the output space y = {1, ■ ■ ■ ,g}. Typically, one of the most 
basic ingredients in statistical data mining is the data matrix X given by 
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Five aspects of the matrix X that are crucial to a taxonomy of massive data include: (i) The 
dimension p of the input space X, which simply represents the number of explanatory variables 
measured; (ii) The sample size n, which represents the number of observations (sites) at which 
the variables were measured/collected; (iii) The relationship between n and p, namely the ratio 


1 





n/p; (iv) The type of variables measured (categorical, ordinal, interval, count or real valued), and 
the indication of the scales/units of measurement; (v) The relationships among the columns of 
X, namely whether or not the columns are correlated (nearly linearly dependent). Indeed, as we 
will make clear later, massive data, also known as big data, come in various ways, types, shapes, 
forms and sizes. Different scenarios of massive data call upon tools and methods that can be 
drastically different at times. The rest of this paper is organized as follows: Section 2 presents 
our suggested taxonomy for massive data based on a wide variety of scenarios. Section 3 presents 
a summary of the fundamental statistical learning theory along with some of the most commonly 
used statistical learning methods and their application in the context of massive data; Section 4 
presents a comparison of predictive performances of some popular statistical learning methods 
on a variety of massive data sets. Section 5 presents our discussion and conclusion, along with 
some of the ideas we are planning to explore along the lines of the present paper. 


II. On the Diversity of Massive Data Sets 

Categorization of massiveness as a function of the input space dimensionality p 

Our idea of a basic ingredient for a taxonomy for massive data comes from a simple reasoning. 
Consider the traditional multiple linear regression (MLR) setting with p predictor variables un¬ 
der the Gaussian noise. In a typical model space search needed in variable selection, the best 
subsets approach fits 2P — 1 models and submodels. If p = 20, the space of linear models is 
of size 1 million. Yes indeed, one theoretically has to search a space of 1 million models when 
p — 20. Now, if we have p = 30, the size of that space goes up to 1 billion, and if p — 40, the 
size of the model space goes up to 1 trillion, and so on. Our simple rule is that any problem with an 
input of more than 50 variables is a massive data problem, because computationally searching a thousand 
trillion is clearly a huge/massive task for modern day computers. Clearly, those who earn their keep 
analyzing inordinately large input spaces like the ones inherent in microarray data (p for such 
data is in the thousands) will find this taxonomy somewhat naive, but it makes sense to us based 
on the computational insights underlying it. Besides, if the problem at hand requires the estima¬ 
tion of covariance matrices and their inverses through many iterations, the 0(p 3 ) computational 
complexity of matrix inversion would then require roughly 125000 complexity at every single 
iteration which could be quickly untenable computationally. Now, it's clear that no one in their 
right mind decides to exhaustively search a space of a thousand trillion models. However, this 
threshold gives us somewhat of a point to operate from. From now on, any problem with more 
than 50 predictor variables will be a big data problem, and any problem with p exceeding 100 
will be referred to as a massive data problem. 


Categorization of massiveness as a function of the sample size n 

When it comes to ideas about determining how many observations one needs, common sense 
will have it that the more the merrier. After all, the more observations we have to close we are to 
the law of large numbers, and indeed, as the sample size grows, so does the precision of our es¬ 
timation. However, some important machine learning methods like Gaussian Process Classifiers, 
Gaussian Process Regression Estimators, the Relevance Vector Machine (RVM), Support Vector 
Machine (SVM) and just about all other kernel methods operate in dual space and are therefore 
heavily dependent on the sample size n. The computational and statistical complexity of such 
methods is driven by the same size n. Some of these methods like Gaussian Processes and the 
Relevance Vector Machine require the inversion of n x n matrices. As a result, such methods 
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could easily be computationally bogged down by too large a sample size n. Now, how large is 
too large? Well, it takes 0(n 3 ) operations to invert an n x n matrix. Anyone who works with 
matrices quickly realizes that with modern-day computers, messing around with more that a 
few hundreds in matrix inversion is not very smart. These methods can become excruciatingly 
(unpractically) slow or even unusable when n gets ever larger. For the purposes of our categoriza¬ 
tion, we set the cut-off at 1000 and define as observation-massive any dataset that have n > 1000. 
Again, we derive this categorization based on our observations on the computational complexity 
of matrix inversion and its impact on some of the state-of-the-art data mining techniques. 


Categorization of massiveness as a function of the ratio n/p 

From the previous argumentation, we could say that when p > 50 or n > 1000, we are compu¬ 
tationally in the presence of massive data. It turns out however that the ratio n/p is even more 
important to massive and learnability than n and p taken separately. From experience, it's our 
view that for each explanatory variable under study, a minimum of 10 observations is needed to 
have a decent analysis from both accuracy and precision perspectives. Put in simple terms, the 
number of rows must be at least 10 times the number of columns, specifically n > 10 p. Using this 
simple idea and the fact that information is an increasing function of n, we suggest the following 
taxonomy as a continuum of n/p. 



n v< 1 

Information 

Poverty 

(n <SC p) 

1 < 2 < 10 

Information 

Scarcity 

f >io 

Information 

Abundance 

(n p) 

n > 1000 

Large p, Large n 

A 

Smaller p, Large n 

B 

Much smaller p, Large n 

C 

n < 1000 

Large p, Smaller n 

Smaller p, Smaller n 

Much smaller p, Small n 


D 

E 

F 


Table 1: In this taxonomy, A and D pose a lot of challenges. 


III. Methods and Tools for Handling Massive Data 

Batch data vs incremental data production 

When it comes to the way in which the data is acquired or gathered, the traditionally assumed 
way is the so-called batch, where all the data needed is available all at once. In state-of-the- 
art data mining however, there are multiple scenarios where the data is produced/delivered in 
a sequential/incremental manner. This has prompted the surge in the so-called online learning 
methods. As a matter of fact, the perceptron learning rule, arguably the first algorithm that 
launched the whole field of machine learning, is an online learning algorithm. Online algorithms 
have the distinct advantage that the data does not have to be stored in memory. All that is 
required in the storage of the built model at time t. In the sense the stored model is assumed to 
have accumulated the structure of the underlying model. Because of that distinct feature, one may 
think of using online algorithms even when the whole data available. Indeed, when the sample 
size n is so large that the data cannot fit in the computer memory, one can consider building a 
learning method that receives the data sequentially / incrementally rather than trying to load the 
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whole data set into memory. We shall refer to this aspect of massive data as sequentialization or 
incrementalization. Sequentialization is therefore useful for both streaming data and massive data 
that is too large to be loaded into memory all at once. 


Missing Values and Imputation Schemes 

In most scenarios of massive data analytics, it is very common to be faced with missing values. 
The literature on missing values is very large, and we will herein simply mention very general 
guidelines. One of the first thing one needs to consider with missing values is whether they 
are missing systematically or missing at random. The second important aspect is the rate of 
missingness. Clearly, when we have abundance of data, the number of missing values is viewed 
differently. Three approaches are often used to address missingness: (a) Deletion, which consists 
of deleting all the rows that contain any missingness; (b) Central imputation, which consists of 
filling the missing cells of the data matrix with central tendencies like mode, median or mean; (c) 
model-based imputation using various adaptation of the ubiquitous Expectation-Maximization 
(EM) algorithm. 


Inherent lack of structure and importance of preprocessing 

Sentiment analysis based on social media data from facebook and twitter, topic modelling based 
on a wide variety of textual data, classification of tourist documents or even to be more general 
the whole field of text mining and text categorization require the manipulation of inherently 
unstructured data. All these machine learning problems are of great interest to end-users, statis¬ 
tical machine learning practitioners and theorists, but cannot be solve without sometimes huge 
amounts of extensive pre-processing. The analysis of a text corpus for instance never starts with 
a data matrix like the X defined in Equation {!]). With these inherently unstructured data like text 
data, the pre-processing often leads to data matrices whose entries are frequencies of terms. It's 
important to mention that term frequency matrices tend to contain many zeroes, because a term 
deemed important for a handful of documents will tend not to appear in many other documents. 
This content-sparsity can be a source of a variety of modelling problems. 


Homogeneous vs Heterogeneous input space 

There are indeed many scenarios of massive data where the input space is homogeneous, i.e. where 
all the variables are of the same type. Audio processing, image processing and video processing 
all belong to a class of massive data where all the variables are of the same type. There are 
however many other massive data scenarios where the input space is made up of variables of 
various different types. Such heterogeneous input spaces arise in fields like business, marketing, 
social sciences, psychology, etc ... where one can have categorical, ordinal, interval, count, and 
real valued variables gathered on the same entity. Such scenarios call for hybridization, which 
may take the form of combining two or more data-type-specific methods in order to handle the 
heterogeneity of the input space. In kernel methods for instance, if one has both textual inputs 
and real valued inputs, then one could simply use a kernel /C = txK,\ + (1 — oftCz that is the 
convex combination of two data-type-specific kernels, namely a string kernel K, \ and a real valued 
kernel K- 2 - Hybridization can also be used directly in modelling through the use of combination 
of models. 
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Difference in measurement scale and the importance of transformation 

Even when the input space is homogeneous, it is almost always the case that the variables are 
measured on different scales. This difference in scales can be the source of many modelling 
difficulties. A simple way to address this scale heterogeneity is to perform straightforward trans¬ 
formations that project all the variables onto the same scale. 

Standardization: The most commonly used transformation is standardization which leads to all the 
variables having zero mean and unit variance. Indeed, if Xj is one of the variables in X and we 
have n observations Xy, X2j, ■ ■ ■ , X n j, then the standardized version of X,y is 



x a~ x i 


Ef=i (Xij-Xjf 


n 

where nXj = Yf ^ij 
j=i 


Unitization: is another form of transformation commonly used. Unitization simply consists of 
transformation the variables such that all take values in the unit interval [0,1]. The resulting 
input space is therefore the unit p-dimensional hypercube, namely [0, l] p . With unitization, if Xj is 
one of the variables in X and we have n observations Xy, X2j, ■ ■ ■ , X n j, then the unitized version 
of Xjj is given by 



Xjj — min Xj 
max Xj — min Xj 


Dimensionality reduction and feature extraction 

Learning, especially statistical machine learning, is synonymous with dimensionality reduction. 
Indeed, after data is gathered, especially massive data, nothing can be garnered in terms of in¬ 
sights until some dimensionality reduction is performed to provide meaningful summaries reveal¬ 
ing the patterns underlying the data. Typically, when people speak of dimensionality reduction, 
they have in mind the determination of some intrinsic dimensionality q of the input space, where 
q XSi p. There are many motivations for dimensionality reduction (a) achieve orthogonality in 
the input space (b) eliminate redundant and noise variables, and as a result perform the learning 
in a lower dimensional and orthogonal input space with the benefit of variance reduction in the 
estimator. In practice, lossy data compression techniques like principal component analysis (PCA) 
and singular value decomposition (SVD) are the methods of choice for dimensionality reduction. 
However, when n <SC p, most of these techniques cannot be directly used in their generic forms. 


Kernelization and the Power of Mapping to Feature Spaces 

In some applications, like signal processing, it is always the case that n p in time domain. A 
ten seconds audio track at a 44100Hz sampling rate, generates a vector of dimension p — 441000 
in time domain, and one typically has only few hundreds or maybe a thousand tracks for the 
whole analysis. Typically image processing problems are similar in terms of dimensionality 
with a simple face of size 640 x 512 generating a p = 327680 dimensional input space. In both 
these cases, it's impossible to use basic PCA or SVD, because n p, making it impossible to 
estimate the covariance structure needed in eigenvalue decomposition. One of the solution to 
this problem is the use of the methods that operate in dual space, like kernel methods. In recent 
years, kernelization has been widely applied and with tremendous success to PCA, Canonical 
Correlation Analysis (CCA), Regression, Logistic Regression, k-Means clustering just to name a 
few. Given a dataset with n input vectors x, G X from some p dimensional space, the main 
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ingredient in kernelization is a bivariate function /C( , •) defined on X x X and with values in R, 
and the corresponding matrix of similarities K known as the Gram matrix and defined as 


K = 


/C(*i,*i) IC(x 1/ x 2 ) 

tC{x 2 ,x i) IC(x 2 ,x 2 ) 


K,{xi,x n ) 
lC{x 2 , Xn) 


tC{x n ,xf) )C(x n ,x 2 ) 


K,(x n ,X n ) 


Crucial to most operations like kernel PCA is the centered version of the Gram matrix given by 
K—{I n — I n )K(I n - I n ) =K- I n K - KI U + InKln, 


where I n £ R nxn and I n E R nxn are both n x n matrices defined as 



' i i 

■ i ■ 


"10- 

■ 0 ' 

„ 1 
In — — 

i i ■ 

■ i 

and I n = 

0 1 ■ 

■ 0 

n 

i i 

■ i 

.00- 

• 1 


The next step is to solve the eigenvalue problem 

1 - 

—Kvj — A jVj, 
n 


where V[ E R n and A; E R for i — 1, ■ • • ,n. In matrix form, the eigenvalue problem is 

-K = VAV J . 

n 


In fact, basic PCA can be formulated in kernel form using the Euclidean inner product kernel 
/C(x,, Xj) = (xj, x ( ) = xjxj, sometimes referred to as the vanilla kernel. If we center the data, i.e, 
such that Y^i=\ x ij — 0/ then the Gram matrix is 


X J X 1 

* 7*2 

■ *7 Xn 

X 2 X 1 

* 7*2 ■■ 

■ * 7*1 1 

x 7i x 1 

* 7*2 ■ ■ 

■ *7 Xn 


- XX T . 


Now, the covariance matrix is C = ^ Y!i=\ x i x J = yyA ' X, and PCA based on the covariance is 
simply j;X T Xwj = A jWj for j — ,p with Wj E R p and \j E R. 


Aggregation and the Appeal of Ensemble Learning 

It is often common in massive data that selecting a single model does not lead the optimal 
prediction. For instance, in the presence of multicollinearity which is almost inevitable when p 
is very large, function estimators are typically unstable and of large variance. The now popular 
bootstrap aggregating also referred to as bagging offers one way to reduce the variance of the 
estimator by creating an aggregation of bootstrapped versions of the base estimator. This is an 
example of ensemble learning, with the aggregation/combination formed from equally weighted 
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base learners. 

Bagging Regressors: Let g :b) (■) be the bth bootstrap replication of the estimated base regression 
function g(-). Then the bagged version of the estimator is given by 

g(bagging) (x)= 1 £g(b) (x) . 

B b =1 

If the base learner is a multiple linear regression model estimator g(x) = /3 q + x /l, then the bth 
bootstrapped replicate is g( b )(x) = j3g + x /3 ,: ’h and the bagged version is 



Bagging classifiers: Consider a multi-class classification task with labels y coming from y — 
{1,2, • ■ ■ ,m} and predictor variables x = (x-],X 2 , ■ ■ ■ ,x (? ) 1 coming from a ^-dimensional space 
X. Let g' b) (■) be the bth bootstrap replication of the estimated base classifier g(-), such that 
(y) (fc) = g (b) 0) is the bth bootstrap estimated class of x. The estimated response by bagging is 
obtained using the majority vote rule, which means the most frequent label throughout the B 
bootstrap replications. Namely, g( ba SS ln s) (x) = Most frequent label in C^ B )(x), where 

c( B) (x) = |g (1) (x),g (2) (x),-- ■ ,g (B) (x)|. 

Succinctly, we can write the bagged label of x as 

g(bagging) (x) = argmax {freq^fr)} = argmax |g (l {y=t ( b )(x)}) | ■ 

It must be emphasized that in general, ensembles do not assign equal weights to base learners in 
the aggregation. The general formulation in the context of regression for instance is 

g (agg) (x) = fV b) g (b) (x). 

6=1 

where the aggregation is often convex, i.e. Ylb =1 — 1- 


Parallelization 

When the computational complexity for building the base learner is high, using ensemble learn¬ 
ing techniques like bagging becomes very inefficient, sometimes to the point of being impractical. 
One way around this difficulty is the use of parallel computation. In recent years, both R and 
Matlab have offered the capacity to parallelize operations. Big data analytics will increasingly 
need parallelization as a way to speed up computations or sometimes just make it possible to 
handle massive data that cannot fit into a single computer memory. 
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Regularization and the Power of Prior Information 

All statistical machine learning problems are inherently inverse problems, in the sense that learn¬ 
ing methods seek to optimally estimate an unknown generating function using empirical obser¬ 
vations assumed to be generated by it. As a result statistical machine learning problems are 
inherently ill-posed, in the sense that they typically violate at least one of Hadamard's three well- 
posedness conditions. For clarity, according to Hadamard a problem is well-posed if it fulfills the 
following three conditions: (a) A solution exists; (b) The solution is unique; (c) The solution is stable, 
i.e does not change drastically under small perturbations. For many machine learning problems, the 
first condition of well-posedness, namely existence, is fulfilled. However, the solution is either 
not unique or not stable. With large p small n for instance, not only is there a multiplicity of solu¬ 
tions but also the instability thereof, due to the singularities resulting from the fact that n <§<C p. 
Typically, the regularization framework is used to isolate a feasible and optimal (in some sense) 
solution. Tikhonov's regularization is the one most commonly resorted to, and typically amounts 
to a Lagrangian formulation of a constrained version of the initial problem, the constraints being 
the devices/objects used to isolate a unique and stable solution. 


IV. Statistical Machine Learning Methods for Massive Data 

We consider the traditional supervised learning task of pattern recognition with the goal of esti¬ 
mating a function / that maps an input space X to a set of labels y. We consider the symmetric 
zero-loss £(Y,f(X)) = 1 / Y yf(x))> and the corresponding theoretical risk function 

R(f) = E[l(Y,f(X))} = [ £(y,f(x))dP(x,y) = Pr[Y ^ /(X)]. 

JXxy 

Ideally, one would like to find the universally best classifier f* that minimizes the rate R{f) of 
misclassification, i.e.. 


/* = argmin|lE[/(Y,/(X))]j = argminj Pr[Y ^/(X)]j. 

It is impossible in practice to find /*, because that would require knowing the joint distribution 
of (X, Y) which is usually unknown. In a sense, R{f), the theoretical risk, serves as a standard 
only and helps establish some important theoretical results in pattern recognition. For instance, 
although in most practical problems one cannot effectively compute it, it has been shown theo¬ 
retically that the universally best classifier f* is the so-called Bayes classifier, the one obtained 
through the Bayes' formula by computing the posterior probability of class membership as the 
discriminant function, namely, 

f 7r,p(x|y = j )) 

f*(x) = class* (x) = argmax {Pr[Y = /1 x] } = argmax < —- f— - >. 

je{i,-,g} ;e{i,-,g}l V ( x ) J 

Assuming multivariate Gaussian class conditional densities with common covariance matrix X 
and mean vectors po and p \, the Bayes Risk, that is the risk associated to the Bayes classifier, is 
given by R{f*) — R* = <&(—\/A/2) where <!>(■) is the standard normal cdf and 

A = {pi - PoV^- 1 [pi -p 0 ). 
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Once again, it is important to recall that this R* is not knowable in practice, and what typically 
happens is that, instead of seeking to minimize the theoretical risk R(f), experimenters focus 
on minimizing its empirical counterpart, known as the empirical risk. Given an i.i.d sample 
V — {(xi,i/i), (x 2 , 1 / 2 ) ■ ■ ■ , (x n ,y n )}, the corresponding empirical risk is given by 

£(/) = \ E 

n i—\ 


which is simply the observed (empirical) misclassification rate. It is re-assuring to know that 
fundamental results in statistical learning theory (See Vapnik ( 2000 1) establish that as the sample 
size goes to infinity, the empirical risk mimics the theoretical risk 


limPr[|R(/)-R(/)|< e ]=l. 

n—> 00 


From a practical perspective, this means that the empirical risk provides a tangible way to search 
the space of possible classifiers. Another crucial point is the emphasis on the fact that even with 
this empirical risk, we still cannot feasibly search the universally best function, for such a space 
would be formidably large. That's where the need to choose a particular function class arises. 
In other words, instead of seeking an elusive universally best classifier, one simply proposes a 
plausible classifier, possibly based on aspects of the data, then finds the empirical risk minimizer 
in that space, and then, if the need arises maybe theoretically find out how the associated risk 
compares to the Bayes risk. One of the fundamental results in statistical learning theory has to do 
with the fact the minimizer of the empirical risk could turn out to be overly optimistic, and lead 
to poor generalization performance. It is indeed the case, that by making our estimated classifier 
very complex, it can adapt too well to the data at hand, meaning very low in-sample error rate, 
but yield very high out of sample error rates, due to overfitting, the estimated classifier having 
learned both the signal and the noise. In technical terms, this is referred to as the bias-variance 
dilemma, in the sense that by increasing the complexity of the estimated classifier, the bias is 
reduced (good fit all the way to the point of overfitting) but the variance of that estimator is 
increased. On the other hand, considering much simpler estimators, leads to less variance but 
higher bias (due to underfitting, model not rich enough to fit the data well). This phenomenon 
of bias variance dilemma, is particularly potent with massive data when the number of predictor 
variables p is much larger than the sample size n. One of the main tools in the modern machine 
learning arsenal for dealing with this is the so-called regularization framework whereby instead 
of using the empirical risk alone, a constrained version of it, also known as the regularized or 
penalized version is used. 


Rreg(/) = R(/) + A||/||« 


1 

n 


n 


E 1 {y,^/(x i )} + A \\f\\n> 


where A is referred to as the tuning (regularization) parameter, and ||/||% is some measure of the 
complexity of / with the class T~L from which it is chosen. It makes sense that choosing a function 
/ with a smaller value of ||/||% helps avoid overfitting. The value of A G [0,+ 00 ), controls the 
trade-off between bias (goodness of fit), and function complexity (which is responsibility for 
variance). Practically though, it may still be hard to even explore the theoretical properties of 
a given classifier and compare it to the Bayes risk, precisely because, methods typically do not 
directly act on the zero-one loss function, but instead use at best surrogates of it. Indeed, within a 
selected class T-L of potential classifiers, one typically chooses some loss function £(■,■) with some 
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desirable properties like smoothness and/or convexity (this is because one needs at least to be 
able to build the desired classifier), and then finds the minimizer of its regularized version, i.e., 

“ i=l 


Note that A stills controls the bias-variance trade-off as before. Now, since the loss function 
typically chosen is not the zero-one loss on which the Bayes classifier (universally best) is based, 
there is no guarantee that the best in the selected class TL under the chosen loss function £(■,■) 
will mimic /*. As a matter of fact, each optimal classifier from a given class TL will typically 
perform well if the data at hand and the generator from which ut came, somewhat accord with 
the properties of the space TL. This remark is probably what prompted the famous so-called no 
free lunch theorem, herein stated informally 

Theorem 1. (No Free Lunch) There is no learning method that is universally superior to all other 
methods on all datasets. In other words, if a learning method is presented with a data set zvhose inherent 
patterns violate its assumptions, then that learning method will under-perform. 


The above no free lunch theorem basically says that there is no such thing as a universally su¬ 
perior learning method that outperforms all other methods on all possible data, no matter how 
sophisticated the method may appear to be. Indeed, it is very humbling to see that some of the 
methods deemed somewhat simple sometimes hugely outperform the most sophisticated ones 
when compared on the basis of average out of sample (test) error. It is common practice in data 
mining and machine learning in general, to compare methods based on benchmark data, and 
empirical counterparts of the theoretical predictive measures, often computed using some form 
of re-sampling tool like the bootstrap or cross-validation. Massive data, also known as big data, 
come in various types, shapes, forms and sizes. The specific statistical machine learning tech¬ 
nique used to handle a particular massive data set depends on which aspect of the taxonomy it 
falls in. Indeed, it is important to emphasize that the no free lunch theorem applies more potently 
here, in the sense that there is no panacea that universally applies to all massive data sets. It is 
important however to quickly stress the fact that simplicity in the sense of Ockham's razor non 
plurality principle of parsimony tends to reign supreme when it comes to massive data. In this 
paper, we propose a rough idea of a possible taxomony of massive data, along with some of the 
most commonly used tools for handling each particular class of massiveness. In this paper, we 
consider a few datasets of different types of massiveness, and we demonstrate through compu¬ 
tational results that the no free lunch applies as a stronger as ever. We typically consider some 
of the most commonly used pattern recognition techniques, from those that are most simple and 
intuitive to some that are considered sophisticated and state-of-the-art, and we show that the 
performances vary sometimes drastically from data to data. It turns out, as we will show, that 
depending on the type of massiveness, some methods cannot even be used. We also provide our 
taxonomy of massive ness along with different approaches to dealing with each case. See lVapnik 
(l200d) and Iciuo et al. 1 20051) 


Linear Discriminant Analysis 

Under the assumption of multivariate normality of the class conditional densities with equal 
covariance matrices, namely (x|y = j) ~ MVN(yy, E), or specifically, 

P(X|y = j) = ( 2 7t)P^|E| 1 /2 eX P { ~ ^' )T5: “ 1(X “ */)} ' 
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a classifier with excellent desirable properties is Linear Discriminant Analysis (LDA) classifier, 
which, given any new point x, will estimate the corresponding class Y as 


TlDA = /lDa(x) = 1 


{/3 0 +/3 T x>0} 


l{<5l(x)><5 0 (x)} 


where 


or 


Sj(\) = x E fij - -}ij E jij + log 7ij 

P = -po) and /5o =-^(/h+#o) T £ _1 (/h “Po)+log(^- 

Z 7Io 


where using the indicators z,y = I (y, = j) — 1{ . ■}, the estimated prior probabilities of class 
membership is given by fry = (rij/n) = (1/m) E” =1 z,y, the sample mean vectors p ; are given by 
fij = (1/rij) E" =1 ZijXi and the sample covariance matrices Sy = (l/(ny — 1)) E” =1 z;y(x,- — fij)^ — /<y) T , 
so that the sample pooled covariance 


E 


1 n 


(i/«) EE z m( x ; “/*/) ( x *~ 

;=0 i=l 


P;) T 


The estimation of E is clearly central to the use of Linear Discriminant Analysis. However, when 
n p, E is ill-defined at best. One of the approaches to dealing with this drawback is the 
use of regularization which essentially consists of adding a jitter to the diagonal of E. For a G 
(0,1) or alternatively for A G (0, oo), the regularized version E, namely E ree can be obtained 


E r eg = (1 — a)E + txl or E reg = E + At or even E reg = AE + I. Friedman] <Tl989h p rovides one 


- \ i ll v A. 

of the earliest full treatment of this approach to singularity in LDA. Guo et al. ( 120050 give a 
compelling application of this kind of regularization to microarray data. 


very 


Nearest Neighbors Methods 

The so-called /c-Nearest Neighbors approach to learning computes the estimated response Y/ NN 
of a new point x* as follows: first set the neighborhood size k; then choose a distance measure 
d(-, ■) defined on X x X ; then compute d* = d(x*, x,), i — 1, ■ ■ ■ ,n, then rank all the distances 
d* in increasing order d*^ < d * 2 j < • • • < d*^ < < • • • < d* n y Then specify 14(x*) = {x,- : 

d(x*,Xj) < d * k .}, the size k neighborhood of x*, made up of the k points in T> that are closest to 
x* according to d{-, ■). The estimate response is the Most frequent label in 14(x*), specifically 


IkNN = /km(x*) = argmax |p W (x*)} (2) 

;e(L 1 1 J 

where 

k i=i 


(k) / 

estimates the probability that x* belongs to class j based on 14(x*). Indeed, pj (x* j can be 
thought of as a rough estimate of 7 ty(x*) — Pr[Y* = y |x*], the posterior probability of class 
membership of x*,ie p'- ki (x*) =« 7ry(x*). See lHastie et ali d200lt) and IClarke et al. 1 2009 1). kN- 
earest Neighbors (kNN) essentially performs classification by voting for the most popular re- 
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sponse among the k nearest neighbors of x*. In a sense, kNN provides the most basic form of 
nonparametric classification Thanks to the fact that the estimated response Y£ NN for x* is - at 
least - a crude nonparametric estimator of Bayes classifier's response, which somewhat justifies 
(or at least explains) the great interest in kNN. Typical distances used in practice include the 

Euclidean distance d(x,-, xy) = ||x,- — Xy 11 2 = (X^ = \ ( x u ~ x jt) ) / and the Manhattan distance 

= ||x; — xy||x = Y^ = i \ x ie — x ;/|. One of the appeals of the k-Nearest Neighbors approach 
in both classification and regression lies in the fact it seamlessly applied to data of any type, as 
long as a valid and appropriate distance can be defined. For instance, with binary data where 
each x ( = (x,|, • ■ ■ , x,-p) £ {0,1}P is a ^-dimensional vector of binary (indicator) values, one could 
use the Hamming distance or the very useful Jaccard distance defined as dj(x u Xj) — 1 — /(x„ Xj), 
where, J(xj,Xj) is the Jaccard similarity index, defined by 


xjxi + xjxj-xjxf 

Clearly, |/(x,-, xy)| < 1, with maximum value of 1 attained at /(x,, x,-) and minimum value of 0 
corresponding to two vectors with no matching 1 values. Since the fundamental building block 
of kNN is the distance measure, one can easily perform classification beyond the traditional setting 
where the predictors are numeric. For instance, classification with kNN can be readily performed 
on indicator attributes x, = (x, |, ■ ■ ■ ,x,y,) T £ {0, 1 }P . kNN classifiers are inherently naturally multi¬ 
class, and are used extensively in applications such as image processing, character recognition 
and general pattern recognition tasks. 

When it so happens as it often does that the input space X is nonhomogeneous, a typical ap¬ 
proach to nearest neighbors pattern recognition consists of defining a distance that is the direct 
or convex sum of the type-specific variables. For instance, if there is a group of numeric variables 
and a group of categorical variables, one could use the distance d\{-, ■) for the first group, and 
di(-, ■) for the second group, and then either use the direct sum d(xj,Xj) — di(x;,xy) + d 2 (xj,xy) 
or the convex sum d(x.j,Xj ) = adi(x,-,xy) + (1 — oc)d 2 (xj,Xj) for some a £ (0,1). 

Nearest Neighbors Algorithms are extremely popular in Statistical Data Mining and Machine 
Learning probably due to the fact that they offer the simplest and most flexible form of nonpara¬ 
metric predictive analytics. In this paper, we explore various aspects of predictive analytics in 
regression and pattern recognition (classification) using kNearest Neighbors. The complexity of 
the underlying pattern in k-Nearest Neighbors Analysis is controlled by the size k of the neighbor¬ 
hood. We show how Cross validation and other forms of re-sampling can be used to determine 
the optimal value k that helps achieve bias-variance trade-off and thereby good generalization 
(low prediction error). 


J(Xi/X/) = 


Support Vector Machines 


Consider a response variable taking values in { —1,+1} and a regularized empirical risk func¬ 
tional given by 

£ reg (/3o,£) = \ t (! - yf(j8 0 + jS T x,))+ + 1/3, 1 2 , 

n z—1 Z ;=0 
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where A e R+ is the regularization (tuning) hyperparameter, and the function (w) + — max(0 ,u) 
is used to define the hinge loss 


%;//( x 0) = (i -y/(^o + /3 T x,))+ = 

Solving the optimization problem 


0 if l-y;0S o + /5 T x;) < 0 

1 - Viifio + x i) if 1 - y;(/3o + /? T x/) > 0 


(A>/3 T ) T = argmin j ^ £ (1 “ + £ T X;)) + + y E 10/1 

(/3 0 ,/3 t ) t gRP+ 1 l n i=l z ;=0 


yields the linear support vector machine (SVM) binary classifier whose predicted response is 

/svm(x) = sign (0„ + 0 T x) 


The above solution assumes that the decision boundary between the two classes is a hyperplane. 
However, it often happens that the decision boundary is not linear. Kernelization is the standard 
approach to handling the nonlinearity of the decision boundary in support vector machine clas¬ 
sification. The kernel version of the SVM classifier is given by 


/svm(x) = sign ( J^yi&iK(x,Xi) + 0o 


where fC(-, ■) is a bivariate function called kernel defined on X x X, and used to measure the 
similarity between two points in observation space. The a,'s are determined via quadratic pro¬ 
gramming optimization, with the nonzero a,'s corresponding to the so-called support vectors. 
One of the most general and most commonly used kernel in the context of pattern recognition is 
the Gaussian Radial Basis Function kernel given by 


£(x ; ,x) = exp - 


1 llXj-X, 


■j\\2 


The second most commonly used kernel is the Laplace radial basis function kernel defined as 


tC(x„Xj) = exp I -- 


1 ||x/ — X,-||l 


There are many other kernels like t he polynom ial ker nel /C(x,, x,) = (scale(x,-, x,) + of f set) degree , 
and others, see Iciarke et al. ( 2009 1. Bousquei 120031) . [Tipping! ( 2001 ). Vapnik (2000 ) among others. 
When it so happens that the input space is nonhomogeneous, one could remedy by defined a 
hybrid that is a linear convex combination of other dat-type specific kernels, namely 


K.(xj,Xj) — aK,\{xi,Xj) + (1 — oc)K,2{ x i,Xj) 

It can therefore be said that the Support Vector Machine has the potential to use massive data 
tools like Regularization, Kernelization, Hybridization. In fact, we will see later that even aggre¬ 
gation/combination will be using in both the bagging and boosting SVM classifiers. 
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Logistic Regression 

Arguably one of the most widely used pattern recognition machines, logistic regression is indeed 
used in the context of massive data with both regularization and kernelization frequently resorted 
to in a variety of scenarios. Using the traditional {0,1} indicator labelling, the empirical risk for 
the binary linear logistic regression model is given by 


Hfa'P) = - jE{y;(/3o + /3 T Xi)-log l + exp {(^ 0 + ^ T Xi)}]}. 

1 i=1 


However, if we use the labelling { — 1, +1} as with SVM, the empirical risk for the linear logistic 
regression model is given by 


R(/3 o,/S) 


■ £ log [l + exp { -iji(f 3 0 + /3 T x ; 
i=l 


One of the most common extensions of this basic formulation of logistic regression is the use 
of the regularized version of the empirical risk with i\ norm on the space of the coefficients 
j6y's. This so-called LASSO logistic regression achieves both shrinkage and variable selection in 
situations where the data contains redundant and/or noise variables. 


R-regifiorfi) ~ — V, l°g 

n i=1 


+ ex p{— yi(j8 0 + /8 T Xf)} 


-AE 1/3/1 

;=o 


An even more powerful extension of the logistic regression model comes with the use of kernels 
to capture the nonlinearity of underlying decision boundary of the classifier. Using kernels as 
defined earlier, the regularized empirical risk for the kernel logistic regression is given by 

£reg (g) = \ E lo S i 1 + ex P {-y;£(*;)}] + j\\g\\n K 

n i—\ 


where 

n 

g(xi) =v + Y^w j K.(x ir x j ). 

M 

with g — v + h, v £ R and h G H/c- Here, 'Hk; is the Reproducing Kernel Hilbert Space (RKHS) 
engendered by the kernel tC. The predicted class (label) of x via KLR is given by 

/KL “ W = 3ig " (l + expl-|(x)} - l) ' 

One of the main advantages of KLR over SVM lies in the fact that KLR unlike SVM provides both 
the predicted label (hard classification) and the probability (soft classification) thereof, while SVM 
is inherently built to provide only the predicted label. KLR also extends naturally to multi-class 
while SVM requires more complicated modelling to extend beyond binary classification. 

Classification and Regression Trees Learning 

Understanding trees is indeed straightforward as they are intuitively appealing piecewise func¬ 
tions operating on a partitioning of the input space. Given V — \ (x\, Y|), ■ ■ ■ , (x„, Y n ) L with 
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x; e X, Yj G {1/''' ,g}- If T denotes the tree represented by the partitioning of X into q regions 
Rl, Rh ' ' ' , Rq such that X = u‘j_ j Rf, then, all the observations in a given terminal node (region) 
will be assigned the same label, namely 

c l = argmax \ -J- ^ I(Yj = ;) i 

As a result, for a new point x, its predicted class is given by 


<7 

Lrree = /Tree(x) — CgXgixf 

£=1 


where I^(■) is the indicator function of R/, i.e. (x) — 1 if x G R f and I,;(x) = 0 if x ^ R/ ; . Trees 
are known to be notoriously unstable. Methods like bagging described earlier are often applied 
to trees to help reduce the variance of the tree estimator. 


V. Predictive Performance Comparison of Learning Machines on some 

Massive Datasets 

When we are given a benchmark test set {(xi,yi), (x 2 ,y 2 ), • ■ • , (x m ,y m )}, we can assess the gen- 
eralizability (predictive strength) of a regression function / using the Empirical Prediction Error 
(EPE) defined as 

-i m 

EPE (/) = -E%/'/( x ;))- 

;=i 

In the k-Neatest Neighbors context, we consider S/ c = {k min , ■ ■ ■ ,/c max }, the set of possible values 
of k, and the optimal k can then be estimated as 

£(° p t) _ argmin|EPE(/ kNN )|. 

It's often the case in practice that the training set is the only dataset available. In such cases, the 
training set is subsampled. Typically, the data set is split into training set and test set, and many 
realizations of the EPE are computed over many replications of the split. Let f j be a regression 

estimator and let f - ; be its r-th replication based on the rth split of the data into training and 
test sets. Now, let 

Er = EPE(^ r) ) 

be the rth replication of the test Empirical Predictive Mean Squared Error based on the test por¬ 
tion of the split. Then we have Ei, E 2 , ■ ■ • , Er, and can perform all kinds of statistical analyses on 
these numbers in order to gain deeper insights in the predictive strength of fj. For instance, given 
different competing estimators f\, fir ■ ■ ,/s, we can plot comparative boxplots to assess virtually 
which of the estimators has the best performance. We could also simply compute measures of 
central tendency and measures of spread on these empirical predictive measures. Clearly, this 
gives us a potent framework that can be used for determining the predictively optimal value of k 
when using the kNearest Neighbors Algorithm. Indeed, if we consider each value of k as defin¬ 
ing a different regression estimator, we can then compare them using EPE and indeed choose the 
value of k at which EPE is minimized. 
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A more commonly used approach involving a systematic subsampling as opposed to random 
(stochastic) subsampling is provided by the ubiquitous cross validation tool. To choose the opti¬ 
mal number of neighbors by leave one out cross validation (LOOCV), one computes 

^.( opt ) = argmin j CV (/ kNN ) j , 

where — {/c min , ■ ■ ■ ; /c max } is the set of possible values of k, and 


CV(/km) — — F. £(yi' 

n i=l 


1 - 


with / kM denoting an instance of the estimator / k uN obtained without the zth observation. 
Once the "optimal" value of k is estimated using one of the above approach, we compare the 
predictive performance of the k-Nearest Neighbors estimates to the estimates produced by other 
function estimators. 

Our first data set the Musk data set, with n — 476 and p — 166 which will fall in category E of 
our taxonomy, since n < 1000 and 1 < n/p < 10. 
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Figure 1: Comparison of the average prediction error over R = 100 replications on the Musk data 
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Figure 2: Comparison of the average prediction error over R = 100 replications on the Musk data 


VI. Conclusion and discussion 

Throughout this paper, we have attempted to provide a detailed account of various aspects of big 
data. The taxonomy provided here is neither unique nor exhaustive, but we hope to have sown 
the seed for a more rigorous debate on what really does constitute big. In the interest of concision, 
we left out the computational exploration of data in the case of large p small n because such an 
exploration constitutes the subject of lengthy papers in their own right. One thing that seems to 
emerge in our attempt to define and develop a taxonomy of big data is the so called No free Lunch 
theorem mentioned earlier. As our simulations show, no technique appears to be universally the 
best on all data scenarios. Commonsense should lead us to realize that any attempt a deriving a 
universally superior technique naturally leads to unmanageably complex models that may well 
turn to be unusable at best and even inferior in performance (due to their complexity) at worst. 
We plan to explore in greater depth aspects of the taxonomy that we only scratched here, namely 
diving into deeper insights on model tuning in the context of large p small n, the challenges of 
efficient and effective parallelization in the context of distributed machine learning. 
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